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Performance  Analysis  and  Optimi¬ 
zation  of  Asynchronous  Circuits1 

Steven  M.  Burns  and  Alain  J.  Martin 

Computer  Science  Department 
California  Institute  of  Technology 
Pasadena,  CA  91125  USA 
{  steveb ,  alain  }  @  vlsi  .cs .  caltech.edu 

Abstract 

We  present  a  method  for  analyzing  the  time  performance  of  asynchronous 
circuits,  in  particular,  those  derived  by  program  transformation  from  concur¬ 
rent  programs  using  the  synthesis  approach  developed  by  the  second  author. 
The  analysis  method  produces  a  performance  metric  (related  to  the  time 
needed  to  perform  an  operation)  in  terms  of  the  primitive  gate  delays  of  the 
circuit.  Such  a  metric  provides  a  quantitative  means  by  which  to  compare 
competing  designs.  Because  the  gate  delays  are  functions  of  transistor  sizes, 
the  performance  metric  can  be  optimized  with  respect  to  these  sizes.  For 
a  large  class  of  asynchronous  circuits — including  those  produced  by  using 
our  synthesis  method — these  techniques  produce  the  global  optimum  of  the 
performance  metric.  A  CAD  tool  has  been  implemented  to  perform  this 
optimization. 


1  Introduction 

Performance  analysis  of  a  synchronous  computer  system  is  simplified  by  an 
external  clock  that  partitions  the  events  in  the  system  into  discrete  segments. 
In  asynchronous  systems,  no  such  quantization  exists.  Instead,  the  operation 
of  the  system  proceeds  at  a  rate  determined  by  the  speed  of  its  individual 
components,  and  the  sequencing  of  the  operation  of  the  components.  Unlike 
the  synchronous  case,  the  time  needed  to  perform  an  asynchronous  compu¬ 
tation  cannot  be  determined  by  merely  counting  the  number  of  clock  cycles 
required  and  multiplying  by  the  clock  period.  Instead,  to  determine  the 
time  required  to  perform  the  computation  as  a  whole,  the  times  of  those 
individual  components  of  the  computation  that  must  occur  sequentially  are 
summed. 

The  techniques  required  to  analyze  asynchronous  systems  resemble  those 
used  to  determine  the  clock  period  of  a  synchronous  system,  that  is,  summing 
the  delays  along  the  longest  path  through  the  combinational  logic  connecting 
adjacent  latches.  In  the  clocked  case,  the  critical  path  has  a  clear  beginning 
and  a  clear  end  because  all  paths  are  broken  by  latches.  No  clear  separation 
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is  available  in  asynchronous  systems.  Analysis  procedures  must  deal  directly 
with  cyclic  critical  paths;  thus,  existing  critical-path  analysis  tools  such  as 
CRYSTAL  [11]  cannot  be  easily  applied  to  this  problem. 

This  paper  discusses  a  framework  for  determining  the  time  needed  to 
perform  computations  using  asynchronous  systems,  and  applies  especially 
to  repetitive  computations.  Early  work  in  the  scheduling  of  concurrent  com¬ 
puting  elements  [14]  is  closely  related  to  our  approach.  Previous  work  in 
the  area  of  timed  Petri  nets  [13,  6]  applies  to  this  problem  as  well.  The 
results  we  describe  here  are  based  on  event-rule  systems,  a  different  formal¬ 
ism  that  is  more  closely  connected  to  the  methods  we  use  to  synthesize  the 
asynchronous  systems.  Furthermore,  we  use  our  formalism  to  model  the 
performance  of  asynchronous  circuits,  and  provide  a  method  for  optimizing 
such  circuits  for  performance. 

Martin  ([9]  and  elsewhere)  has  developed  a  synthesis  method  whereby 
asynchronous  circuits  are  produced  from  concurrent  program  descriptions. 
By  applying  a  systematic  series  of  semantics-preserving  transformations,  a 
high-level  description  (CSP  program)  is  refined,  using  the  intermediate  forms 
of  handshaking  expansions  and  production  rules,  until  a  provably  correct 
asynchronous  CMOS  circuit  is  constructed. 

At  each  stage  of  the  synthesis  procedure,  a  variety  of  transformations 
can  potentially  be  applied.  In  the  automated  compiler  of  [2],  these  choices 
are  made  so  that  the  same  subcircuit  template  can  be  used  to  implement 
each  instance  of  the  same  CSP  language  construct.  Instances  of  these  small 
templates  are  composed  together  to  form  a  correct  circuit  that  implements 
the  original  CSP  program.  However,  in  order  to  produce  high-performance 
circuits,  these  choices  must  be  directed  by  performance  concerns.  We  ob¬ 
served  this  potential  benefit  of  performance-directed  transformations  during 
the  design  of  the  Caltech  Asynchronous  Microprocessor  [8].  The  decisions 
of  what  transformation  to  apply  were  based  on  performance  goals  and  this 
accounts  for  its  high  performance. 

Event-rule  (ER)  systems  can  be  used  at  each  stage  of  the  synthesis  proce¬ 
dure  to  analyze  the  potential  performance  of  the  current  refinement.  Given  a 
trace  of  the  execution  of  a  complete,  closed  program  (environment  included), 
an  ER  system  can  be  generated  from  any  of  the  intermediate  forms:  CSP 
programs,  handshaking  expansions,  production  rules,  or  CMOS  circuits.  The 
trace  of  execution  is  used  to  unroll  each  process  that  contains  guarded  com¬ 
mands  into  a  straight-line  process.  In  the  cases  where  the  trace  of  execution 
repeats,  a  repetitive  ER  system  can  be  generated.  The  cycle  period  (the  time 
between  repeated  events)  can  be  determined  using  the  techniques  explained 
in  Section  2. 

These  techniques  provide  an  expression  for  the  cycle  period  in  terms  of 
maximums  and  sums  of  individual  component  delays.  At  the  circuit  level, 
the  component  delays  are  functions  of  transistor  widths  and,  as  such,  the 
cycle  period  can  be  optimized  with  respect  to  these  widths.  Non-linear 
optimization  methods  (such  as  those  used  in  TILOS  [3]  and  COP  [7])  can 
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be  used  to  perform  the  optimization  of  this  expression  for  the  cycle  period. 
Our  approach  differs  from  those  used  for  synchronous  systems  because  we 
optimize  all  critical  paths  simultaneously. 


2  Event-Rule  Systems 

An  event-rule  (ER)  system ,  is  a  pair  (E,R),  where: 

E  is  a  set  of  events,  and 

R  is  a  set  of  rules  defining  timed  constraints  between  the  events.  Each 
r  G  R  is  written  e  A  /,  where 

e  G  E  is  the  source  of  r, 
f  €  E  is  the  target  of  r,  and 
a  G  [0,  +oo)  is  the  delay  of  r. 

Neither  E  nor  R  need  be  finite.  When  R  is  infinite,  we  require  that  no 
event  depend  on  an  infinite  number  of  other  events.  That  is,  the  set  of 
predecessors  (immediate  or  otherwise)  of  an  event  must  be  finite.  Sometimes 
it  is  convenient  to  view  (E,  R)  as  a  directed  graph  (multiple  arcs  and  self¬ 
loops  allowed);  this  graph  will  be  referred  to  as  the  constraint  graph  G.  For 
a  given  {E,  R ),  there  is  a  (possibly  empty)  set  of  functions,  T,  that  satisfies: 

T  is  a  subset  of  the  functions  from  E  to  [0,  +oo)  ; 

t  G  T  if  and  only  if 

t(f)  >  <(e)  +  a  for  every  e  A  /  e  R  .  (1) 

We  call  a  function  t  in  the  set  T  a  timing  function  of  (E,  R).  Each  t  represents 
a  possible  or  consistent  timing  specification  for  the  events  of  the  system. 
If  the  set  T  is  empty,  the  constraints  (1)  cannot  be  satisfied  by  any  such 
function  t.  In  this  case,  the  (E,  R)  is  called  infeasible ;  otherwise,  it  is  called 
feasible. 

Example  2.1  Consider  the  ( E,R )  with 

E  =  {a,  6,  c}  R  =  {a  nt  6,  b  a,  b  c} . 

This  ER  system  is  feasible  if  and  only  if  aa  =  0  and  aj  =  0.  [— ] 

The  smallest  timing  function  denotes  the  earliest  time  at  which  the  events 
of  E  can  execute,  and  thus  corresponds  to  the  observed  execution  times  of 
real  circuits.  Any  feasible  ER  system  with  a  constraint  graph  containing 
cycles  can  be  transformed  into  an  equivalent  acyclic  system  [1]. 
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Lemma  2.1  If  the  constraint  graph  G  of  an  event-rule  system  (E,  R)  is 
acyclic,  then  there  exists  a  unique  function  t  6  T  such  that  for  every  t  €  T, 

t(e)  <  t(e)  f°r  every  e  G  E.  (2) 

We  call  t  the  timing  simulation  of  ( E,R ). 

Proof:  We  propose  the  following  recursive  definition  for  t: 


i(f)  = 


0  if  sources(f)  =  0 

max{£(e)  +  a\  e&  f£R}  otherwise. 


(3) 


We  can  show,  by  contradiction,  that  i  is  the  smallest  timing  function.  (See 
[1]  for  a  complete  proof.)  | 

Example  2.2  The  ER  system  defined  by  the  constraint  graph: 


a 


aab 


aeb 


<*  be 


d 


has  the  timing  simulation: 

t(a)  =  0 
t(e)  =  0 

t(b)  =  max(o:a6,  aej,) 

□ 


i(c)  =  max(aa6,  aeb)  +  aic 
t(d)  =  max(d!at,  aeb)  +  abc  +  acd 


2.1  Repetitive  Systems 

ER  systems  of  unbounded  size  corresponding  to  asynchronous  circuits  can 
be  generated  from  bounded  structures.  Consider  the  event  set  E  generated 
from  a  finite  set  E'  by 


E  =  E'  x  N  . 

The  elements  of  E'  are  called  transitions.  An  event  (u,  i)  €  E  is  the  indexed 
occurrence  of  the  transition  u€  E' .  The  non-negative  integer  i  is  called  the 
occurrence  index. 

The  rule  set  R  is  also  generated  from  a  finite  set  R'.  The  elements  of  R! 
are  quadruples 

r  =  (u,v,a,e)  6  R',  where  R!  C  E'  x  E'  x  [0,  +oo)  x  Z, 
which  we  will  write  as 


r  —  {u,i  —  e)  A  {v,i). 

The  integer  e  is  called  the  occurrence-index  offset  of  r.  The  dummy  variable 
i  is  replaced  by  a  non-negative  integer  no  less  than  e  when  r  is  instantiated 
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(an  infinite  number  of  times)  to  form  the  generated  rule  set  R.  We  require 
i  >  max(0,  e)  so  that  the  occurrence  indices  of  both  the  source  and  the 
target  events  of  the  instantiated  rule  are  both  non-negative  and  thus  in  E. 
We  call  (E,  R)  the  (general)  ER  system  generated  from  the  repetitive  ER 
system  ( E',R '). 


Example  2.3  Consider  the  repetitive  ER  system  constructed  from  a  circuit  con¬ 
taining  a  single  Muller  C-element: 


t,x 

i,yl, 

21} 

(xi,t- 

1) 

“A* 

<2 l,l), 

1) 

<2l,i). 

“A* 

(x  t ,  i) , 

(*L») 

“A* 

(xt,i) 

"A* 

(2t  ,*), 

(yl,i) 

“A* 

(2  T,  •) , 

"A* 

(xl,0, 

“Art 

{y  i.») } 

Events  are  occurrences  of  transitions  of  circuit  variables.  The  event  (x  i)  rep¬ 
resents  the  ith  occurrence  of  a  transition  from  x  =  false  to  x  =  true.  Similarly, 
(x  J.,  i)  represents  the  ith  occurrence  of  a  transition  from  x  =  true  to  x  =  false. 
The  repeated  rules  correspond  to  dependencies  introduced  by  the  inverters  and  the 
C-element  that  make  up  the  circuit.  Initially,  x  and  y  are  false  and  z  is  true.  We 
can  represent  the  infinite  sets  E  and  R  graphically: 


<zT,0)  (zj.,0) 


«*T«t 

az{x  t 

/ 

\ 

/ 

\ 

/ 

<*T,o) 

<2  1,1) 

“*i»t 

“*li It 

\ 

s 

\ 

/ 

\ 

(y  t,0)  (2/1,0) 


Notice  that  event  (2  j,0)  has  no  predecessors.  In  the  timing  simulation,  t((zl,0)) 
is  set  to  0.  [For  ease  of  notation,  t((zl,0))  will  sometimes  be  written  as  t(z  j,  0).] 
The  entire  timing  simulation  t,  which  can  be  constructed  by  inspection  from  the 
constraint  graph,  is: 


i(z  1  ,i)=pi  i(z  t,  i)  =  max(a4a.T  +  a44 ,  a4yT  +  ayT4)  +  p  i 

t(x  f ,  i)  =  o4sf  +  pi  t(x  1) *)  =  max(a:4i4  +  a44,  »4yf  +  4)  +  a44  +  pi 

Kv  T  >  i)  =  +Pi  t(vU  0  =  max(o:4!4  +  a44 ,  a4yT  +  ayT4)  +  a4yi  +  p  i 

where  p  =  max(a4xT  +  a44,  a4yT  +  ayT4)  +  max(a44  +  a44,  a4yl  +  oyl4).  [-] 


2.2  Linear  Timing  Functions 

In  the  previous  example,  we  saw  that  the  timing  simulation  of  a  repetitive 
ER  system  took  on  a  simple  form  that  is  linear  in  the  occurrence  index  i. 
This  is  not  the  case  for  all  repetitive  ER  systems.  However,  as  is  shown  later 
in  Theorems  2.3  and  2.4,  a  linear  timing  function  exists  whenever  the  timing 
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simulation  exists,  and  the  “best”  such  function  will  be  a  good  approximation 
of  the  timing  simulation. 

We  call  t  G  T  a  linear  timing  function  of  (E\  R'),  if 

t(v ,  i)  =  x„  +  pvi  for  every  v  E  E'  and  i  E  N.  (4) 

Each  xv  and  pu  are  independent  of  i.  For  each  v  E  E',  xv  and  pv  are  called, 
respectively,  the  offset  and  the  cycle  period  of  the  transition  v. 

Because  of  the  linear  form  of  t,  the  timing  function  constraints,  (1), 
reduce  to  linear  inequalities  in  the  offsets  and  the  cycle  periods  of  the  tran¬ 
sitions.  All  dependence  on  the  occurrence  index  i  can  be  eliminated.  For 
each  rule  r  =  {u,i  —  e)  A  (v,  i)  E  R',  we  have  the  infinite  set  of  constraints: 

t(v,  *)  >  *  —  e)  4-  a ,  for  each  i  >  i0  =  max(0,  e) . 

Replacing  t  by  its  definition  (4),  we  get 

xv  +  pvi  >  xu  +  pu(i  -  e)  +  a. 

Rearranging  terms  yields 

xv  >  xu  —  pue  +  a+(pu-pv)i.  (5) 

Equation  (5)  can  never  be  satisfied  for  all  i  if  pu  >  pv.  Thus,  the  infinite  set 
of  constraints  generated  by  r  can  be  replaced  by  the  two  inequalities, 

xv  >  xu-  pue  +  a  +  (pu-pv)  i0  ,  and  (6) 

Pv  >  Pu-  (7) 

We  define  the  collapsed-constraint  graph  G'  of  (E',  R')  as  the  directed 
graph  with  nodes  from  E'  and  arcs  from  R' .  From  (7)  we  see  that  for 
a  feasible  solution  to  exist,  a  partied  ordering  between  the  p„’s  must  be 
satisfied.  If  two  nodes,  u  and  v,  are  in  the  same  cycle  of  G',  then  pu  must 
equal  pv.  Thus,  all  transitions  in  the  same  strongly-connected  component 
of  G'  have  the  same  cycle  period.  In  the  following,  we  consider  only  those 
repetitive  ER  systems  in  which  G'  is  strongly  connected,  and  we  use  p  to 
denote  the  cycle  period  of  every  element  in  E' .  Thus  (6)  simplifies  to 

xv  >  xu  +  a  —  ep.  (8) 

Each  arc  of  G'  is  labeled  with  a  —  ep  to  signify  constraint  (8). 

2.3  Minimum-Period  Linear  Timing  Functions 

Among  the  possible  linear  timing  functions,  there  are  those  that  minimize 
the  cycle  period  p.  The  techniques  of  linear  programming  [4]  can  be  used  to 
find  such  a  minimum-period  linear  timing  function. 
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The  constraints  of  a  linear  timing  function,  (8),  are  simple  linear  inequal¬ 
ities  in  the  xc’s  and  p.  By  ordering  the  sets  E'  and  R',  we  can  construct  a 
linear  program  in  matrix  form: 

z  =  minOTx  -I- 1  Tp,  A'x  +  ep>a ,  x,p  >  0 .  (9) 

The  matrix  A'  is  the  arc-node  incidence  matrix  of  the  collapsed-constraint 
graph  G'.  If  row  j  of  A'  represents  the  constraint  rj  6  R',  and  column  k  of 
A!  represents  the  transition  uk  G  E',  then 

—1  if  uk  is  the  source  transition  of  rj 
aik  =  1  if  uk  is  the  target  transition  of  rj 

0  otherwise  . 

The  j th  elements  of  the  (column)  vectors  e  and  a  are  the  occurrence-index 
offset  and  the  delay  of  constraint  rj,  respectively. 

Example  2.4  Consider  the  system  ( E ',  R')  corresponding  to  a  lazy-active/passive 
buffer  (Figure  1)  connected  to  the  trivial  environment: 

Rl  =  {  </*!,*  — 1>  (/of,i),  (rof, *  —  1)  (/of,*), 

<Kt,0  (ri  T ,  i)  (rof,*), 

(rof,*),  (rii,*)  ^  (roJ.,i), 

(/of,*)  > ^  V2 

{ro\,,i  —  1)  k?  (r*t,*),  (rot.i)  h?  (ri i, *>  } 

For  this  example,  A'x  +  ep  >  a  is: 

/10000  -1  00\  /  1  \ 

1  0  0  0  0  0  -1  0  /  a:,oT  \  1 

0  -1  0  0  1  0  0  0  xi<r  0 

0  0  1  -1  0  0  0  0  xroT  0 

0  0  1  0  -1  0  0  0  Xrtt  0 

0  0  0  0  0  0  1  -1  xH  +  0  p~ 

-1  1  0  0  0  0  0  0  x,a  0 

0  0  0  0  -1  1  0  0  xM  0 

0  0  0  1  0  0  -1  0  V  *rii  /  1 

Voo-iooooiy  \o  J 

□ 

The  duality  theorem  of  linear  programming  relates  the  primal  program 
(9)  to  the  dual  program: 

w  =  maxyTa,  yT A'  <  0T,  yTe  <1,  y  >  0 .  (10) 

If  both  the  primal  and  the  dual  programs  have  optimal  solutions,  then  the 
optimal  value  z  of  the  primal  equals  the  optimal  value  w  of  the  dual.  Since 
A'  is  an  arc-node  incidence  matrix,  and  thus  A'l  =  0,  any  y  feasible  for  (10) 
also  satisfies  yT A'  =  0T.  Thus,  to  determine  the  optimal  value  w ,  we  need 
only  solve  the  simplified  dual  program: 

w  =  ma x.yTa,  yT A'  =  0T,  yTe  <1,  y  >  0 .  (11) 


(  otic r  ^ 
«Jo f 
Olol 
®ro| 
®ro\ 
Oiroi 

OiUi 

dr*| 

V  ari|  / 
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2.4  Cycle  Vectors  of  a  Graph 

A  cycle  c  of  length  l  in  a  directed  graph,  Q  =  {//,  A),  is  an  ordered  subset 
(a0,  oi, . . . ,  af_x)  of  the  arcs  A  such  that  target(ak~x)  =  source{ak )  for  all 
0  <  k  <  £  and  target (ae_x)  =  source(a o).  The  cycle  c  can  be  represented  by 
a  cycle  vector  u,  a  {0,  l}-vector  of  length  |A|,  where  uj  =  1  if  and  only  if  the 
jth  arc  of  A  is  in  the  set  c.  For  each  cycle  vector  u,  uT A'  =  0T,  where  A'  is 
the  arc-node  incidence  matrix  of  the  graph  Q.  A  cycle  is  simple  if  each  node 
in  the  cycle  has  one  incoming  and  one  outgoing  cycle  me.  The  following 
lemma  relates  the  simple-cycle  vectors  to  an  arbitrary  vector  y  satisfying 
yTA'  =  0r. 


Lemma  2.2  Let  Ux,  0  <  i  <  q  denote  the  simple-cycle  vectors  of  a  graph 
with  arc-node  incidence  matrix  A’.  Then,  if  y  >  0  is  such  that  yT A!  —  0T, 
there  exist  scalars  0,-  >  0,  0  <  *  <  q  such  that 


y  —  OqUq  +  6XUX  +  .  .  .  +  Oq-iUg-i  .  (12) 

Proof:  See  [1]  for  complete  proof.  The  proof  uses  induction  on  the  number 
of  simple  cycles  in  the  graph  of  A'.  | 

This  lemma  provides  a  straightforward  means  of  determining  the  mini¬ 
mum  cycle  period  p.  By  enumerating  every  simple  cycle  in  the  collapsed- 
constraint  graph,  and  computing  the  sum  of  the  delays  and  the  sum  of  the 
occurrence-index  offsets  around  each  cycle,  we  can  find  p. 

Theorem  2.3  The  minimum  cycle  period,  or  equivalently,  the  optimal  value 
of  the  dual  program  (11),  is 


max 


{ 


for  all  simple-cycle  vectors  Uk  >  . 


(13) 


Proof:  Let  U  be  the  simple-cycle  matrix  constructed  by  concatenating  the 
(column)  simple-cycle  vectors  U0,Ui,...,  Uq-X.  By  construction,  UT  A'  =  0. 
By  Lemma  2.2,  any  y  >  0  with  yT A!  =  0T  can  be  represented  as  the  product 
UG,  where  the  vector  ©  has  non-negative  elements.  The  dual  program  (11) 
reduces  to: 

w  =  max0T({7To;)  ,@T(UTe )  <  1  ,0  >  0. 

The  dual  of  the  reduced  dual  is  easily  solved: 


2  =  min  A  ,  (UTe)  A  >  (UTa)  ,  A  >  0 .  (14) 

The  smallest  scalar  A  that  satisfies  the  vector  inequality  in  (14)  yields  the 
desired  minimum  cycle  period.  | 


Example  2.5  The  minimum  cycle  period  of  the  previous  example  can  be  de¬ 
termined  by  a  cycle-period  analysis.  Two  views  of  the  collapsed-constraint  graph 
are: 
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“M  -  P 


do 


lo  t 


“iff 
“lo t  -  P 


«t 


“lei 


/of 


rof 


a. 


roj. 


ri  f 


^  r  i[ 


“l.l 

“rof 


rof 


a 


ro] 


Zi| 


rif 


/qt 


a8 


ai 


rof 


05 


«T 


ri  f 


02 


09 


/oi 


07 


04 


rot 


03 


Ki 


n’t 


“rlf  P 


as 


The  view  on  the  left  uses  the  standard  labeling;  on  the  right,  the  labels  denote 
the  numbered  arcs.  The  three  cycles  through  the  graph  can  be  represented  by  the 
simple-cycle  matrix: 

/011011100l\ 

uT  =  1010001100) 

\o  0  0  1  0  1  0  0  1  1/ 

The  vector  inequality  ( UTe )  A  >  (UTa)  becomes: 

(“lot  +  Otlol  +  “rot  +  “rot  +  “«t  +  “W| 

“lot  +  “lot  +  “Iff  +  “lit 
“rot  +  “rot  +  “rif  +  “rit 

Thus,  the  minimum  cycle  period  is  max(ao,“i,“2).  □ 


2.5  Efficient  Computation  of  the  Minimum  Cycle  Period 

Enumeration  of  all  simple  cycles  of  a  graph  does  not  lead  to  an  efficient 
procedure  to  compute  the  cycle  period  because  an  arbitrary  graph  may  have 
exponentially  more  simple  cycles  than  nodes  or  arcs. 

Lawler  in  [5]  provides  an  0(\Af\  |*4|  log B)  solution  to  the  minimal  cost- 
to-time  ratio  cycle  problem,  which  is  equivalent  to  (13).  Given  a  candi¬ 
date  cycle  period  that  is  too  short  and  one  that  is  too  long,  the  algorithm 
uses  binary  search  and  a  procedure  to  test — for  a  particular  candidate  cycle 
period — whether  there  is  a  negative  cycle  in  the  graph  in  order  to  determine 
the  minimum  cycle  period  in  log  B  steps.  ( B  is  related  to  desired  precision 
of  the  result.) 

In  [1],  we  provide  an  0(|AT|  |^4|)  algorithm  to  determine  the  minimum  cy¬ 
cle  period  when  the  sum  of  the  e  values  around  each  simple  cycle  is  bounded. 
This  algorithm  is  based  on  a  direct  solution  to  the  linear  program  (11)  and 
its  corresponding  dual  and  uses  a  customization  of  the  general  primal-dual 
algorithm  [12]. 


2.6  Case  Study:  Comparison  of  Two  FIFOs 

We  now  apply  the  performance  analysis  techniques  of  Section  2  to  compare 
the  performance  of  two  implementations  of  a  first-in/first-out  (FIFO)  queue. 
While  it  is  possible  to  extend  repetitive  ER  systems  to  allow  the  analysis  of 
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an  unbounded  linear  array  of  identical  processes  [1],  we  instead  perform  the 
analysis  directly  on  small  arrays.  We  leave  it  as  an  exercise  to  the  reader  to 
show  that,  in  these  two  cases,  instantiating  additional  processes  in  the  array 
does  not  increase  the  cycle  period. 

Three  stages  of  a  four-phase  lazy-active/passive  (lap)  FIFO  (Example  2.4) 
with  the  datapath  between  the  stages  are  shown  in  Figure  1.  For  a  cir¬ 
cuit  level  implementation  of  a  lap  stage,  see  [1]  or  [10].  The  three  criti¬ 
cal  cycles  through  the  transitions  of  the  middle  process  are  represented  by 
bold  arcs  in  the  collapsed-constraint  graphs  (Figure  2).  Assuming  all  de¬ 
lays  in  the  circuit  are  small  compared  to  the  datapath  delay,  we  get  that 
p  =  +  max(o!jr>f,  a^). 

Two  stages  of  Ivan  Sutherland’s  two-phase  FIFO  ([16],  Figure  16)  can  be 
described  by  the  circuit  and  collapsed-constraint  graph  shown  in  Figure  3. 
Since  this  circuit  is  symmetric  in  up  and  down  transitions,  we  write,  for 
example,  li  for  both  li  f  and  li  J..  The  bold  arcs  in  the  graph  represent  the 
critical  cycle.  Assuming  all  delays  in  the  circuit  are  small  compared  to  the 
datapath  delay,  we  get  that  p  =  2 aD. 

A  more  complete  analysis  is  provided  in  [1]  which  compares  several  other 
designs  for  FIFOs.  This  example  shows  that  the  best  existing  two-phase  and 
four-phase  implementations  of  a  FIFO  have  comparable  cycle  periods.  This 
result  validates  our  long-standing  beliefs  that  four-phase  implementations 
are  as  fast  as  two-phase  implementations,  and  that  because  of  their  simplicity 
and  generality,  they  offer  a  better  discipline  in  which  to  design  asynchronous 
circuits. 

2.7  Approximating  the  Timing  Simulation 

We  now  show  that  a  minimum-period  linear  timing  function  provides  an 
accurate  approximation  to  the  timing  simulation. 

Theorem  2.4  Let  t  and  t  be  a  minimum-period  linear  timing  function 
and  the  timing  simulation,  respectively,  of  the  connected  repetitive  system 
(E',  R').  There  exists  a  finite  B  such  that  for  all  u  G  E'  and  all  z  >  0 

««,.  =  t(u,  i)  —  t(u,  i)  <B  . 

Proof:  By  definition  for  each  u  and  i 

t(u,  i)  =  xu+pi 

t(u,  i)  =  xu  +  pi  —  sUii  . 

Each  sUii  is  non-negative  because  t  is  the  smallest  timing  function.  For 
the  constraints  generated  from  r  =  (u,i  -  e)  A  (v,  i)  6  R',  we  define  the 
non-negative  slack  variables,  zr^  and  zr,  thus  transforming  inequalities  into 
equalities: 

P&  "I-  Q:  -|“  zr  — =  xv  (16) 

pS  SU)i_e  -f-  Of  “1“  Zrt ,  =  Xv  (16) 
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lap  =  *[/of;  [/*];  /o|;  [r*J;  rot;  h«];  ro].;  [-./i]] 


Figure  1:  Three  stages  of  a  four-phase  lap  FIFO.  The  cigar¬ 
shaped  objects  represent  the  datapaths.  The  dashed  lines 
denote  the  flow  of  data  in  the  FIFO. 


Figure  2:  Critical  cycles  through  the  graph  of  a  three- 
stage  lap  FIFO.  The  two  graphs  are  identical;  the  sets  of 
connected,  bold  arcs  represent  the  critical  cycles.  An  arc 
with  a  tick  mark  has  e  =  1.  All  other  arcs  have  e  =  0. 


Figure  3:  Circuit  and  graph  of  Sutherland’s  two-phase 
FIFO.  The  operators  marked  C  are  Muller  C-elements. 
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By  subtracting  these  equations  and  simplifying,  we  get 

zr  ~  zr,i  =  sv,i  su,i — e  •  (17) 

Prom  Theorem  2.3,  p^2rec£r  =  lLrecar  for  at  least  one  cycle  c.  Adding  the 
constraints  on  t,  (15),  for  each  r  6  c,  we  see  that 

r€c  r€c  r€c  r€c  rGc 

Since  along  any  cycle  £r6c2Ur  =  we  have  for  all  rEc, 

zr  =  0  . 

By  (17),  su,i— e  >  s«,»  for  all  i  >  max(0,  e)  and  all  u,  v  on  cycle  c.  By  summing 
along  the  cycle  c,  we  see  that  for  each  u  €  c  and  i'  >  0 

su,i'  >  «u,i  where  i  =  i'  +  er  . 

r£c 

Therefore,  we  can  bound  su>i,  for  every  u  G  c,  by 


For  any  transition  v  not  on  cycle  c,  we  find  a  path  Pv  to  transition  v  from 
a  transition  u  on  c.  Because  G'  is  strongly  connected,  such  a  path  must  exist 
and  be  independent  of  i.  Then,  by  summing  (17)  along  that  path,  we  get 
for  all  i,  i'  >  0 

&u,i'  "b  Zr  —  &v,t  ? 
r€P» 

where  i  =  i'  +  X)r6p„  £r-  But  X)rep„  is  independent  of  i;  thus,  sv>i  is 
bounded  by  a  quantity  that  does  not  increase  with  successive  occurrences. 
Thus,  every  sv>i  with  v  g  c  is  bounded  by  B  where 


B  > 

B  > 


max 


v  $  c  A  i  < 


B' 


+  max 


2r 


and 


3  Performance  Optimization 

Using  the  above  analysis  method,  a  performance  metric  (the  minimum  cycle 
period  p )  can  be  expressed  in  terms  of  the  primitive  delays  of  an  ER  sys¬ 
tem.  These  delays  can  be  estimated  from  the  sizes  of  the  transistors  that 
make  up  the  operators  of  the  circuit  and  from  the  way  these  operators  are 
interconnected,  by  using  a  simple  resistance-capacitance  (RC)  timing  model. 
Composing  the  performance  metric  in  terms  of  component  delays  with  the 
delay  approximation  of  the  operators  in  terms  of  transistor  sizes,  we  get  an 
expression  for  the  performance  of  the  system  in  terms  of  transistor  sizes. 
This  expression  is  minimized,  producing  optimal  sizes  for  the  transistors. 
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Figure  4:  Linear  approximation  of  a  CMOS  pulldown. 

3.1  Tau  Model 

A  simple  RC  switch  model  is  used  to  relate  each  individual  delay  a  to  the 
various  widths  ( w’s )  of  the  circuit’s  transistors.  Each  transistor  is  modeled 
as  a  switch  with  a  resistance  inversely  proportional  to  its  width.  The  gate  of 
each  transistor  has  a  capacitance  to  ground  proportional  to  its  width.  Source 
and  drain  capacitances  axe  also  proportional  to  transistor  widths.  Thus,  the 
delays  between  a|  and  zl,  and  6|  and  z  j,  of  the  circuit  shown  in  Figure  4 
are  modeled  as: 

aatz|  =  RiCi  +  (Hi  +  R2)C2,  =  (Hi  +  R2)C2 , 

Hi=  fi/w !,  R2  =  n/w2, 

Cl  =  +w2),  C2  —  Kext  (w2  +  W3)  +  Cwiring  +  Kg(w 4  +  W5)  , 

where  /x  is  a  constant  that  describes  the  differing  per-unit-width  strengths 
of  the  n-  and  p-channel  transistors,  Kint  is  the  per-unit-width  capacitance 
contributed  by  internal  (to  the  series  chain)  drain  and  source  terminals,  Kext 
is  the  per-unit-width  capacitance  contributed  by  external  (the  output  node) 
drain  terminals,  Ks  is  the  per-unit-width  gate  capacitance,  and  Cwiring  is  the 
capacitance  contributed  by  wiring.  All  capacitances  are  expressed  in  terms 
of  transistor  width;  thus  Kg  =  1.  Each  delay  a  is  expressed  in  units  of  r, 
the  time  needed  for  a  unit-width  n-channel  transistor  to  switch  a  unit-width 
load.  (Thus,  /in  =  1,  fip  >  1.)  The  values  of  Kint,  Kext  and  Cwiring  axe  not 
constant,  but  depend  on  the  final  circuit  layout  that  depends  weakly  on  the 
transistor  widths.  This  dependence  is  normally  small  and  is  ignored  in  the 
optimization  problem. 

Example  3.1  As  an  example,  we  will  construct  the  optimization  equations  for 
the  C-element  circuit  of  Example  2.3.  For  purposes  of  this  example,  the  constants 
of  the  tau  model  take  on  these  values: 

A"ext  1,  ATjnt  —  0.5,  Ks  =  1,  Up  =  fln  —  1,  Cwiring  ~  0 . 

Since  the  mobilities  of  the  pull-up  and  pull-down  devices  are  assumed  to  be  identical, 
by  symmetry  the  widths  of  the  pull-up  and  pull-down  devices  are  identical  as  are 
the  pull-up  and  pull-down  delays.  By  removing  the  transition-direction  reference 
from  the  delay  names  (for  example,  <244  is  named  azx),  and  expanding  the  delays 
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between  transitions  on  x  and  y,  and  transitions  on  z  (by  introducing  the  new  variable 
it),  the  expression  for  the  cycle  period  becomes 

P  =  2(aUj  +  max(ajX  +  axu,azy  +  ayu)) . 

The  circuit  and  corresponding  a  values  are  shown  below: 


V 


□ 


«U2  =  ^  (2  Wo  +  2  Wi  +  2w2) 

<*zx  =  ^(2wi  +  2w4) 

°‘zy  =  -^(2w2  +  2w3) 

axu  =  (^  +  ^)  (2w4  +  2 w0) 

avu  ~  2^7  (w3  +  w *)  + 


3.2  Convex  Objective  Function 

Every  a  derived  using  this  simple  model  (and  also  other  more  accurate  ones) 
is  a  posynomial  function  (polynomial  with  positive  coefficients  and  positive 
variables)  of  the  transistor  widths  id’s,  and  thus  a  convex  function  of  the 
logic’s  [4].  Because  both  the  stun  and  the  maximum  of  two  convex  functions 
are  convex  functions,  the  resulting  expression  for  p  is  a  convex  function  of 
the  logo’s;  and,  thus,  each  minimum  of  p  is  global.  The  addition  of  convex 
constraints,  for  example,  to  limit  power  consumption  or  to  bound  transistor 
sizes,  does  not  alter  the  unique  minimum  property. 

Example  3.2  The  optimal  value  for  the  cycle  period  of  Example  3.1  occurs  when 
the  width  values  are  a  positive  scalar  multiple  of 

( wQ,wuW2,w3,Wi )  =  (1.0,0.4782,1.1632,1.8002,0.9209). 

(See  [1]  for  an  explanation  of  how  to  achieve  a  unique  minimum  point  by  constrain¬ 
ing  the  power  consumption  and/or  the  largest  and  smallest  transistor  width.)  j— | 

We  have  implemented  a  CAD  tool  for  solving  the  resulting  non-linear, 
non-differentiable,  convex  optimization  problems  based  on  the  subgradient 
techniques  described  by  Shor  [15].  Table  1  lists  the  results  of  this  program 
when  applied  to  a  variety  of  circuits.  The  column  ntrans  denotes  the  number 
of  transistors  in  the  circuit,  and  thus  the  number  of  free  variables  in  the 
optimization  problem.  The  columns  pUnsized  and  psized  show  the  cycle  period 
in  units  of  r  of  the  circuit  before  and  after  optimization.  In  the  unsized  case, 
all  n-channel  transistors  have  equal  sizes,  and  each  p-channel  transistor  is 
i/Pp  times  (optimal  for  an  inverter  ring)  wider  than  the  n-channel  transis¬ 
tors.  The  %  imp  column  shows  the  percent  improvement  in  p  provided  by 
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^trans 

Punsized 

Psized 

%  imp 

CPU 

Three  inverters 

6 

42 

42 

0 

1.4 

C-element  and  two  inverters 

10 

66 

61 

8 

2.8 

One  stage  lap 

21 

143 

114 

25 

9.3 

Three  stage  lap 

59 

189 

143 

32 

49 

Three  stage  lap* 

59 

189 

143 

32 

25 

Ten  stage  lap 

192 

189 

151 

25 

279 

Ten  stage  lap* 

192 

189 

151 

25 

189 

Simple  microprocessor  control* 

285 

646 

430 

50 

369 

Table  1:  Performance  of  CAD  tool. 

the  optimization.  The  CPU  column  denotes  the  number  of  user  seconds 
needed  to  compute  the  optimum  value  on  a  SUN/Sparcstation  1.  A  direct 
implementation  of  the  optimization  algorithm  requires  0(ntrang +ncycle8^max) 
arithmetic  operations  per  iteration,  where  ncycies  is  the  number  of  cycles  used 
to  form  the  cycle-period  function  and  lmax  is  the  maximum  number  of  arcs 
per  cycle.  A  more  sophisticated  implementation  (results  denoted  by  *)  that 
uses  the  primal-dual  algorithm  of  Section  2.5  to  determine  the  cycle  period 
at  each  iteration  requires  only  0(n?rans)  arithmetic  operations  per  iteration. 


4  Summary 

We  have  presented  a  method  for  determining  the  performance  of  circuits 
described  by  event-rule  systems.  Furthermore,  we  have  shown  how  to  opti¬ 
mally  size  transistors  in  such  circuits.  What  we  have  not  shown,  due  to  lack 
of  space,  is  how  to  transform  the  specifications  of  asynchronous  circuits  that 
we  use  for  synthesis  into  ER  systems.  With  the  addition  of  these  techniques, 
we  have  a  complete  method  combining  synthesis  and  performance  analysis. 
The  performance  analysis  can  be  done  early  and  at  each  level  of  the  synthe¬ 
sis  procedure  and  can  be  used  to  guide  the  synthesis  of  efficient  circuits.  A 
complete  description  is  given  in  [1]. 
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